A Brief Introduction to Supervised Learning

Michel Lang

https://www.mathworks.com/discovery/reinforcement-learning.html

  • Unsupervised Learning: learn with unlabeled data
  • Supervised Learning: learn with labeled data
  • Reinforcement Learning: learn through reward maximization

We will focus on a subcategories of supervised learning: classification + regression

Data Structure

  • Labels are categorical for classification
  • Labels are steady for regression
  • Example “Pima Indian Diabetes”: label “pos” vs. “neg”
diabetes age mass pressure pregnant
pos 50 33.6 72 6
neg 31 26.6 66 1
pos 32 23.3 64 8
neg 21 28.1 66 1

Classification

  • Goal: predict label for new observations accurately based only on their features
  • In other words: find a good decision rule to discriminate between label categories
  • Here: simple linear decision rule. Predictions for new observations are determined by side of (hyper-) plane

Regression

  • Goal: predict response for new observations based on their features
  • In other words: find a good function approximation
  • Here: simple polynomial function

Terminology

\(\mathcal{D}\) is data set of \(n\) observations \(\left( \mathbf{x}^{(i)}, y^{(i)} \right)\), \(i = 1, \ldots, n\)

  • Feature vector \(\mathbf{x}^{(i)} \in \mathcal{X}^p\), e.g. \(\mathcal{X} \equiv \mathbb{R}^p\)
  • Label \(y^{(i)} \in \mathcal{Y}\), here \(|\mathcal{Y}| = g\) is number of label categories and \(g \equiv 1\) for regression

Terminology

A function which assigns a prediction to a feature vector is called ML model: \[f: \mathcal{X} \to \mathbb{R}^g\] \[f \left(\mathbf{x}^{(i)} \right) =: \hat{y}^{(i)}\]

  • \(f\) yields a probability (or score) for each of the \(g\) categories
  • What are desired properties of \(f\)?
  • How to find a “good” ML model \(f\)?

Terminology

Optimize the generalization error via empirical risk minimization: \[\mathcal{R}_{\text{emp}}(f) := \frac{1}{n} \sum_{i=1}^n L \left( y^{(i)}, f(\mathbf{x}^{(i)}) \right) \]

  • L is an arbitrary loss function with \(L: \mathbb{R}^g \times \mathbb{R}^g \to \mathbb{R}\)
  • Popular choices for \(L\):
    • Accuracy: \(L(c_1, c_2) = \mathcal{I}(c_1 = c_2)\)
    • Mean Squared Error: \(L(x_1, x_2) = \left( x_1 - x_2 \right)^2\)

Underfitting

The function \(f\) is not capable of capturing the true relationship between features and labels.

Underfitting can be detected by poor model performance on training and test data.

Overfitting

The function \(f\) starts modeling the noise.

Overfitting can be detected by poor model performance only on test data. On training data, the performance can be arbitrarily good by just replicating the data.

Evaluation

Important: You must evaluate on new, unseen observations!

Evaluation on independent test data

Model Selection

The choice of the model depends on:

  1. generalization performance
  2. runtime and available hardware
  3. experience and familiarity of the machine learner
  4. interpretability of the resulting model
  5. other constraints of the application

In data mining competitions, we ultimately optimize for (1), but all other points are important to consider

Model Selection

Suggested procedure:

  1. Start with some baselines: constantly predict majority class
  2. Compare to simple and interpretable ML models like logistic regression, classification trees, or handcrafted rules
  3. Analyze the models, try to improve them via feature engineering
  4. Compare to more complex models (with engineered features) and start hyperparameter tuning

Know your Models

  • The “no free lunch” theorem holds - there is not a single model which works well on all data sets.
  • You usually compare multiple competitors and take the “best”
  • It is important to understand the strengths of models and how to work around their weaknesses
  • Example: Circle data set and linear decision rule

Circle Example

Define \(x_3 := (x_1^2 + x_2^2)\)

Resources